May 10, 2021

Overview of presentation

  1. Introduction to COVID-19 World Vaccine Adverse Reactions Dataset

  2. Project work flow

  3. Project methods

    3.1 Overview of important packages and verbs used

    3.2 Challenges and solutions - Load, Clean and Augment

  4. Visualizations

  5. Modeling

  6. Conclusion and discussion

Introduction

COVID-19 World Vaccine Adverse Reactions

Introduction: Dataset

COVID-19 World Vaccine Adverse Reactions

  • Data from the Vaccine Adverse Event Reporting System (VAERS) created by the Food and Drug Administration (FDA) and Centers for Disease Control and Prevention (CDC)
  • Contains 3 data sets:
    1. PATIENTS.CSV
    2. VACCINES.CSV
    3. SYMPTOMS.CSV
  • Data sets connected by patient IDs (VAERS_ID)

Introduction: Dataset

COVID-19 World Vaccine Adverse Reactions

PATIENTS.CSV: Contains information about the individuals that received the vaccines

## # A tibble: 34,121 x 35
##   VAERS_ID RECVDATE  STATE AGE_YRS CAGE_YR CAGE_MO SEX   RPT_DATE   SYMPTOM_TEXT
##   <chr>    <chr>     <chr>   <dbl>   <dbl>   <dbl> <chr> <date>     <chr>       
## 1 0916600  01/01/20… TX         33      33      NA F     NA         "Right side…
## 2 0916601  01/01/20… CA         73      73      NA F     NA         "Approximat…
## 3 0916602  01/01/20… WA         23      23      NA F     NA         "About 15 m…
## # … with 34,118 more rows, and 26 more variables: DIED <chr>, DATEDIED <chr>,
## #   L_THREAT <chr>, ER_VISIT <chr>, HOSPITAL <chr>, HOSPDAYS <dbl>,
## #   X_STAY <chr>, DISABLE <chr>, RECOVD <chr>, VAX_DATE <chr>,
## #   ONSET_DATE <chr>, NUMDAYS <dbl>, LAB_DATA <chr>, V_ADMINBY <chr>,
## #   V_FUNDBY <chr>, OTHER_MEDS <chr>, CUR_ILL <chr>, HISTORY <chr>,
## #   PRIOR_VAX <chr>, SPLTTYPE <chr>, FORM_VERS <dbl>, TODAYS_DATE <chr>,
## #   BIRTH_DEFECT <chr>, OFC_VISIT <chr>, ER_ED_VISIT <chr>, ALLERGIES <chr>

Introduction: Dataset

COVID-19 World Vaccine Adverse Reactions

VACCINES.CSV: Contains information about the received vaccine

## # A tibble: 34,630 x 8
##    VAERS_ID VAX_TYPE VAX_MANU         VAX_LOT VAX_DOSE_SERIES VAX_ROUTE VAX_SITE
##    <chr>    <chr>    <chr>            <chr>   <chr>           <chr>     <chr>   
##  1 0916600  COVID19  "MODERNA"        037K20A 1               IM        LA      
##  2 0916601  COVID19  "MODERNA"        025L20A 1               IM        RA      
##  3 0916602  COVID19  "PFIZER\\BIONTE… EL1284  1               IM        LA      
##  4 0916603  COVID19  "MODERNA"        unknown <NA>            <NA>      <NA>    
##  5 0916604  COVID19  "MODERNA"        <NA>    1               IM        LA      
##  6 0916606  COVID19  "MODERNA"        011J20A 1               IM        LA      
##  7 0916607  COVID19  "MODERNA"        <NA>    <NA>            IM        LA      
##  8 0916608  COVID19  "MODERNA"        <NA>    1               IM        LA      
##  9 0916609  COVID19  "MODERNA"        011J20… 1               IM        LA      
## 10 0916610  COVID19  "MODERNA"        <NA>    1               SYR       LA      
## # … with 34,620 more rows, and 1 more variable: VAX_NAME <chr>

Introduction: Dataset

COVID-19 World Vaccine Adverse Reactions

SYMPTOMS.CSV: Contains information about the symptoms experienced after vaccination

## # A tibble: 48,110 x 11
##   VAERS_ID SYMPTOM1     SYMPTOMVERSION1 SYMPTOM2     SYMPTOMVERSION2 SYMPTOM3   
##   <chr>    <chr>                  <dbl> <chr>                  <dbl> <chr>      
## 1 0916600  Dysphagia               23.1 Epiglottitis            23.1 <NA>       
## 2 0916601  Anxiety                 23.1 Dyspnoea                23.1 <NA>       
## 3 0916602  Chest disco…            23.1 Dysphagia               23.1 Pain in ex…
## 4 0916603  Dizziness               23.1 Fatigue                 23.1 Mobility d…
## 5 0916604  Injection s…            23.1 Injection s…            23.1 Injection …
## 6 0916606  Pharyngeal …            23.1 <NA>                    NA   <NA>       
## # … with 48,104 more rows, and 5 more variables: SYMPTOMVERSION3 <dbl>,
## #   SYMPTOM4 <chr>, SYMPTOMVERSION4 <dbl>, SYMPTOM5 <chr>,
## #   SYMPTOMVERSION5 <dbl>

Introduction: Aim

The aim of this project is to gain insight on the adverse effects of different Covid-19 vaccines and answer the following questions:

  • Do some vaccines cause more/different symptoms than others?

  • Do patients with some profiles get more/different symptoms?

  • Are certain symptoms correlated with death?

  • Is patient profile correlated with death?

  • Does taking anti-inflamatories reduce the chance of having symptoms?

Methods

Methods: Project workflow

  1. Load data sets (patients, vaccines, symptoms)
  2. Clean each data set individually
  3. Augment and merge the data sets
  4. Make visualizations
  5. Do modelling

Methods: Important packages and verbs

Load and clean

  • readr: read_csv(), write_csv()
  • dyplyr: filter(), select(), distinct(), mutate()
  • tidyr: replace_na()

Augment

  • dplyr: filter(), select(), mutate(), case_when(), arrange(), group_by(), count(), distinct(), summarise(), drop_na(), rename()
  • tidyr: pivot_longer(), pivot_wider(), inner_join(), full_join(), pluck()
  • stringr: regular expressions, str_c(), str_replace(), str_replace()

Analysis

  • ggplot: geom_bar(), geom_boxplot(), geom_tile(), geom_segment(), theme_minimal()
  • forcats: fct_reorder()
  • scales
  • patchwork
  • viridis
  • stats (?): glm(), prcomp()
  • broom: tidy(), glance()
  • purrr: map(), nest()

Methods: Dataset loading

Challenges and solutions

Patients, vaccines and symptoms datasets:

  • Multiple large files → keep them compressed as gz-files and only decompress when reading into R
  • Wrong column types automatically assigned by R → manually assign appropriate column types
  • NA strings (“NA”, “N/A”, “Unknown”, " "…) → assign NAs when loading data

Methods: Dataset cleaning

Challenges and solutions

Patients dataset:

  • Unwanted dirty/uniformative columns → select(-c(CAGE_YR, CAGE_MO, RPT_DATE … ))
  • NAs that should be interpreted as “no” → replace_na(ALLERGIES = “N”)
  • Row duplications → distinct()

Vaccines dataset:

  • Contains non-COVID19 vaccines → filter(VAX_TYPE == “COVID19”)
  • Contains vaccines of unknown manufacturer → filter(VAX_MANU != “UNKNOWN MANUFACTURER”)
  • Row duplications → distinct()
  • Duplicated IDs → add_count(VAERS_ID) %>% filter(n == 1) %>% select(-n)
  • Inconsistent naming of vaccines → rename()
  • Redundant and dirty columns → select(-c(VAX_NAME, VAX_LOT))

Symptoms dataset:

  • SYMPTOMVERSION1-5 columns are unneccessary → select(-c())

Methods: Data augmentation

Challenges and solutions

Patients data set:

  • Columns containing long string descriptions → Make tidy categorical (Y/N) variables
## # A tibble: 3 x 3
##   VAERS_ID OTHER_MEDS                     TAKES_ANTIINFLAMATORY
##   <chr>    <chr>                          <chr>                
## 1 0916983  <NA>                           N                    
## 2 0916988  Ibuprofen  PM the night before Y                    
## 3 0916996  Clobetasol, Benadryl           N
  • Dirty, redundant and uninformative columns → select(-c(ALLERGIES, OTHER_MEDS … ))

Symptoms data set:

  • Too many symptoms and dirty → extract top 20 occurring symptoms and turn them into tidy categorical (TRUE/FALSE) columns
  • Calculate total number of symptoms per patient → mutate() to add column (N_SYMPTOMS)

Methods: Data augmentation

Merging datasets

  • For visualizing, we need the wide format → inner_join(by = VAERS_ID)
  • For modelling, symptoms must be in long-format → pivot_longer() to create:
    • SYMPTOM column: top 20 symptom names
    • SYMPTOM_VALUE column: TRUE/FALSE

Methods: Analysis

Exploratory data analysis

  • Visualizations with ggplot()
  • Reduction of dimensionality (Principal Component Analysis) with prcomp()

Modelling and statistics

  • Logistic regression models with glm()
  • Proportions tests with chisq.test()

Exploratory Data Analysis

Visualization

Visualization

Age, sex and manufacturer distribution

## # A tibble: 3 x 2
##   SEX       n
##   <chr> <int>
## 1 F     24070
## 2 M      8514
## 3 <NA>    828
## # A tibble: 3 x 2
##   VAX_MANU            n
##   <chr>           <int>
## 1 JANSSEN          1106
## 2 MODERNA         16253
## 3 PFIZER-BIONTECH 16053

Visualization

Days until onset of symptoms vs. Age Group

Hypothesis: two peaks corresponding to the innate and acquired immune response

Visualization

Age/sex vs. number of symptoms

Visualization

Vaccine manufacturer vs. number of symptoms

Visualization

Age vs. types of symptoms

Visualization

Sex vs. types of symptoms

Visualization

Vaccine manufacturer vs. types of symptoms

Exploratory Data Analysis

Principal Component Analysis

PCA

Important tools used

Important verbs and tools used:

  • prcomp()
  • augment ()

PCA

PCA plot and rotation matrix

PCA

Scree plot

Modelling

Logistic Regressions

Logistic Regression

Death ~ Patient Profile

Is the patient’s profile (sex, age, allergic/not, ill/not, has/had covid/not) correlated with death?

## # A tibble: 7 x 6
##   term           estimate std.error statistic  p.value odds_ratio
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>      <dbl>
## 1 (Intercept)    -9.39      0.161    -58.2    0         0.0000832
## 2 SEXM            0.929     0.0573    16.2    4.00e-59  2.53     
## 3 AGE_YRS         0.0914    0.00207   44.1    0         1.10     
## 4 HAS_ALLERGIESY -0.0204    0.0605    -0.338  7.35e- 1  0.980    
## 5 HAS_ILLNESSY    1.08      0.0654    16.4    8.86e-61  2.93     
## 6 HAS_COVIDY     -0.113     0.142     -0.794  4.27e- 1  0.893    
## 7 HAD_COVIDY     -0.00375   0.195     -0.0193 9.85e- 1  0.996

Logistic Regression

Death ~ Patient Profile

Is the patient’s profile (sex, age, allergic/not, ill/not, has/had covid/not) correlated with death?

Logistic Regression

Death ~ Symptoms

Are some symptoms correlated with death?

## # A tibble: 20 x 6
##   term          estimate std.error statistic  p.value odds_ratio
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>      <dbl>
## 1 (Intercept)     -2.01     0.0287    -70.1  0             0.134
## 2 HEADACHETRUE    -1.67     0.156     -10.7  7.92e-27      0.188
## 3 PYREXIATRUE     -0.429    0.112      -3.82 1.34e- 4      0.651
## 4 CHILLSTRUE      -1.21     0.171      -7.11 1.17e-12      0.298
## 5 FATIGUETRUE     -0.367    0.115      -3.19 1.41e- 3      0.693
## 6 PAINTRUE        -0.913    0.153      -5.98 2.17e- 9      0.401
## 7 NAUSEATRUE      -0.621    0.139      -4.46 8.17e- 6      0.538
## 8 DIZZINESSTRUE   -2.17     0.193     -11.2  2.87e-29      0.114
## # … with 12 more rows

Logistic Regression

Death ~ Symptoms

Are some symptoms correlated with death?

Many Logistic Regressions

Each Symptom ~ Takes Anti-Inflamatory

Does taking antiiflamatories modify the chance of having symptoms?

## # A tibble: 20 x 9
##   SYMPTOM  estimate std.error statistic p.value conf.low conf.high odds_ratio
##   <chr>       <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>      <dbl>
## 1 HEADACHE  -0.164     0.0987    -1.67   0.0958   -0.362    0.0255      0.848
## 2 PYREXIA    0.0152    0.102      0.150  0.881    -0.189    0.211       1.02 
## 3 CHILLS    -0.121     0.109     -1.11   0.266    -0.340    0.0875      0.886
## 4 FATIGUE    0.0565    0.105      0.539  0.590    -0.154    0.258       1.06 
## 5 PAIN       0.0113    0.110      0.102  0.919    -0.210    0.222       1.01 
## # … with 15 more rows, and 1 more variable: identified_as <chr>

Many Logistic Regressions

Each Symptom ~ Takes Anti-Inflamatory

04_analysis_tests

04_analysis_tests

Chi-squared contingency table tests

DIED JANSSEN MODERNA PFIZER-BIONTECH
N 1090 15281 15212
Y 16 972 841
DIED F M
N 23271 7523
Y 799 991

Conclusion and discussion

References